Fwd: Apple Darwin disabled fsync? - Mailing list pgsql-hackers
From | Peter Bierman |
---|---|
Subject | Fwd: Apple Darwin disabled fsync? |
Date | |
Msg-id | a06010200be3da9564694@[17.202.21.231] Whole thread Raw |
Responses |
Re: Fwd: Apple Darwin disabled fsync?
Re: Fwd: Apple Darwin disabled fsync? |
List | pgsql-hackers |
>Date: Sat, 19 Feb 2005 17:59:21 -0800 >From: Dominic Giampaolo <dbg@apple.com> >Subject: Re: bad fsync? (A.M.) >To: darwin-dev@lists.apple.com > >>MySQL makes the following claim at: >>http://dev.mysql.com/doc/mysql/en/news-4-1-9.html >> >>"InnoDB: Use the fcntl() file flush method on Mac OS X versions 10.3 >>and up. Apple had disabled fsync() in Mac OS X for internal disk >>drives, which caused corruption at power outages." >> >>First of all, is this accurate? A pointer to some docs or a tech note >>on this would be helpful. >> >The comments about fsync() are wrong... > >On MacOS X, fsync() always has and always will flush all file data >from host memory to the drive on which the file resides. The behavior >of fsync() on MacOS X is the same as it is on every other version of >Unix since the dawn of time (well, since the introduction of fsync >anyway :-). > >I believe that what the above comment refers to is the fact that >fsync() is not sufficient to guarantee that your data is on stable >storage and on MacOS X we provide a fcntl(), called F_FULLFSYNC, >to ask the drive to flush all buffered data to stable storage. > >Let me explain in more detail. With fsync() even though the OS >writes the data through to the disk and the disk says "yes I wrote >the data", the data is not actually on permanent storage. Unless >you explicitly disable it, all disks have a write buffer which holds >data you've written. The disk buffers the data you wrote until it >decides to flush it to the platters (and the writes may not be in >the order you wrote them). If you lose power or the system crashes >before the data is written, you can wind up in a situation where only >some of your data is actually on disk. What is worse is that even if >you write blocks A, B and C, call fsync() and then write block D you >may find after rebooting that blocks A and D are on disk but B and C >are not (in fact any ordering of A, B, C, and D is possible). > >While this may seem like a rare case it is not. In fact if you sit >down and pull the plug on a system you can make it happen in one or >two plug pulls. I have even gone so far as to watch this behavior >with a logic analyzer on the ATA bus: I saw the data for two writes >come across the ATA cable, the drive replied and said the writes were >successful and then when we rebooted the data from the second write >was correct on disk but the data from the first write was not. > >To deal with this we introduced the F_FULLFSYNC fcntl which will ask >the drive to flush all of its buffered data to disk. When an app >needs to guarantee that data is on disk it should use F_FULLFSYNC. >In most cases you do not need such a heavy handed operation and >fsync() is good enough. But in an app like a database, it is >essential if you want transactional integrity. > >Now, a little bit more detail: on ATA drives we implement F_FULLFSYNC >with the FLUSH_TRACK_CACHE command. All drives sold by Apple will >honor this command. Unfortunately quite a few firewire drive vendors >disable this command and do not pass it to the drive. This means that >most external firewire drives are not reliable if you lose power or >the system crashes. We can't work-around that unless we ask the drive >to disable the write cache completely (which hurts performance quite >badly -- and even that may not be enough as some drives will ignore >that request too). > >So in summary, I believe that the comments in the MySQL news posting >are slightly confused. On MacOS X fsync() behaves the same as it does >on all Unices. That's not good enough if you really care about data >integrity and so we also provide the F_FULLFSYNC fcntl. As far as I >know, MacOS X is the only OS to provide this feature for apps that >need to truly guarantee their data is on disk. > >Hope this clears things up. > >--dominic
pgsql-hackers by date: